Data Gathering

API Data Gathering

Method

In this project’s data collection phase, I will utilize an API to gather recently generated data, enhancing our ability to analyze current trends. For data that is older or unavailable through the API, I will search for already done datasets on the internet and incorporate them into this project by downloading them.

Downloading Data

Data on Last.fm

I also required data to analyze user behaviors in order to gain insights how the musics are consumed. However, user data is often confidential and not disclosed to public. To address this, I discovered an open dataset about last.fm users listening events, which includes some anonymized user personal information. The link for this dataset is LFM-2b Dataset. This dataset comprises three key components:

The first component consists of three TSV files.

  1. One file contains anonymized user personal information such as gender and age.
  2. The second file stores track information, including track ID and artist names.
  3. The third file contains details about listening events, listing each user event, including the timestamp, the user who listened to the music, and the specific track they played.

Some of the dataset are quite messy and unable to be read directly using pandas with mix sepeartors. So I could only show the samples of the datastes here.

Dataset Name : users.tsv

Sample of the dataset:

Dataset Name : tracks.tsv

Sample of the dataset:

Dataset Name : listening_events.tsv

Sample of the dataset:

API Data Gathering

Muisc lyrics using Genius API

Another crucial aspect in this project is the lyric of the songs which Spotify does not provide. To address this, I used Genius API to obtain the lyrics data. Specifically, I utilized the ‘lyricsgenius’ package in Python to fetch the lyrics for each song by their names. I fetched the lyrcis for each songs in the dataset above, so there will be three lyrics dataset which follows the same format. I will only show the sample of one of them for the sake of brevity.

Dataset Name : genius_lyrics

Basic information:

total 3 columns, 100 rows
Column Non-Null Count Dtype
track_id 100 non-null object
name 100 non-null object
lyrics 100 non-null object

A sample of the dataset:

Audio feature for last.fm tracks

The last.fm data, as mentioned in the previous section, provided valuable user information and activity data. However, the track data lacked audio analytical factors, consisting only of artist and track names, which is unable to conduct any meaningful analysis. To complement this, I utilized the Spotify API to search for and retrieve audio features for the music. The resulting dataset is stored in a JSON file. My plan was to collect the top 10,000 songs’ audio feature it is still under the process of collecting becasue Spotify API does have a rate limit, so it takes time to collect all the data. For now, the dataset only consist part of it, but the format is finalized.

Dataset Name : sample_track_info.json

Basic information:

total 19 columns, 5700 rows
Column Non-Null Count Dtype
danceability 5700 non-null float64
energy 5700 non-null float64
key 5700 non-null int64
loudness 5700 non-null float64
mode 5700 non-null int64
speechiness 5700 non-null float64
acousticness 5700 non-null float64
instrumentalness 5700 non-null float64
liveness 5700 non-null float64
valence 5700 non-null float64
tempo 5700 non-null float64
type 5700 non-null object
id 5700 non-null object
uri 5700 non-null object
track_href 5700 non-null object
analysis_url 5700 non-null object
duration_ms 5700 non-null int64
time_signature 5700 non-null int64
track_id 5700 non-null int64

A sample of the dataset: